|
The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 〔M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. Schapire, L. Sellie ''On the Learnability of Discrete Distributions''. ACM Symposium on Theory of Computing, 1994 ()〕 and it was inspired from the PAC-framework introduced by Leslie Valiant.〔(L. Valiant ''A theory of the learnable''. Communications of ACM, 1984 )〕 In this framework the input is a number of samples drawn from a distribution that belongs to a specific class of distributions. The goal is to find an efficient algorithm that, based on these samples, determines with high probability the distribution from which the samples have been drawn. Because of its generality this framework it has been used in a large variety of different fields like machine learning, approximation algorithms, applied probability and statistics. This article explains the basic definitions, tools and results in this framework from the theory of computation point of view. ==Basic Definitions== Let be the support of the distributions that we are interested in. As in the original work of Kearns et. al.〔 if is finite it can be assumed without loss of generality that where is the number of bits that have to be used in order to represent any . We focus in probability distributions over . There are two possible representations of a probability distribution over . * probability distribution function (or evaluator) an evaluator for takes as input any and outputs a real number which denotes the probability that of according to , i.e. if . * generator a generator for takes as input a string of truly random bits and outputs according to the distribution . Generator can be interpreted as a routine that simulates sampling from the distribution given a sequence of fair coin tosses. A distribution is called to have a polynomial generator (respectively evaluator) if its generator (respectively evaluator) exists and can be computed in polynomial time. Let a class of distribution over X, that is is a set such that every is a probability distribution with support . The can also be written as for simplicity. Before defining learnability its necessary to define good approximations of a distribution . There are three ways to measure the distance between two distribution. The three more common possibilities are * Kullback-Leibler divergence * Total variation distance * Kolmogorov distance The strongest of these distances is the Kullback-Leibler divergence and the weakest is the Kolmogorov distance. This means that for any pair of distributions , : : Therefore for example if and are close with respect to Kullback-Leibler divergence then they are also close with respect to all the other distances. Next definitions hold for all the distances and therefore the symbol denotes the distance between the distribution and the distribution using one of the distances that we describe above. Although learnability of a class of distributions can be defined using any of these distances, applications refer to a specific distance. The basic input that we use in order to learn a distribution is an number of samples drawn by this distribution. For the computational point of view the assumption is that such a sample is given in a constant amount of time. So it's like having access to an oracle that returns a sample from the distribution . Sometimes the interest is, apart from measuring the time complexity, to measure the number of samples that have to be used in order to learn a specific distribution in class of distributions . This quantity is called sample complexity of the learning algorithm. In order for the problem of distribution learning to be more clear consider the problem of supervised learning as defined in.〔Lorenzo Rosasco, Tomaso Poggio, "A Regularization Tour of Machine Learning — MIT-9.520 Lectures Notes" Manuscript, Dec. 2014 ()〕 In this framework of statistical learning theory a training set and the goal is to find a target function that minimizes some loss function, e.g. the square loss function. More formally , where is the loss function, e.g. and the probability distribution according to which the elements of the training set are sampled. If the conditional probability distribution is known then the target function has the closed form . So the set is a set of samples from the probability distribution . Now the goal of distributional learning theory if to find given which can be used to find the target function . Definition of learnability ''A class of distributions is called efficiently learnable if for every and given access to for an unknown distribution , there exists a polynomial time algorithm , called learning algorithm of , that outputs an generator or an evaluator of a distribution such that'' : ''If we know that then is called proper learning algorithm, otherwise is called improper learning algorithm.'' In some settings the class of distributions is a class with well known distributions which can be described by set a set of parameters. For instance could be the class of all the Gaussian distributions . In this case the algorithm should be able to estimate the parameters . In this case is called parameter learning algorithm. Obviously the parameter learning for simple distributions is a very well studied field that is called statistical estimation and there is a very long bibliography on different estimators for different kinds of simple known distributions. But distributions learning theory deals with learning class of distributions that have more complicated description. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Distribution learning theory」の詳細全文を読む スポンサード リンク
|